The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished smartphones. ReCell, a startup aiming to tap the potential in this market and want to analyze the data provided and build a linear regression model to predict the price of a used phone and identify factors that significantly influence
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
data = pd.read_csv("used_phone_data.csv")
data.head(10)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 23.97 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 2020 | 127 | 111.62 | 86.96 |
| 1 | Honor | Android | 28.10 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 2020 | 325 | 249.39 | 161.49 |
| 2 | Honor | Android | 24.29 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 2020 | 162 | 359.47 | 268.55 |
| 3 | Honor | Android | 26.04 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | 278.93 | 180.23 |
| 4 | Honor | Android | 15.72 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | 140.87 | 103.80 |
| 5 | Honor | Android | 21.43 | yes | no | 13.0 | 8.0 | 64.0 | 4.0 | 4000.0 | 176.0 | 2020 | 223 | 157.70 | 113.67 |
| 6 | Honor | Android | 19.84 | yes | no | 8.0 | 5.0 | 32.0 | 2.0 | 3020.0 | 144.0 | 2020 | 234 | 91.74 | 72.29 |
| 7 | Honor | Android | 18.57 | yes | no | 13.0 | 8.0 | 64.0 | 4.0 | 3400.0 | 164.0 | 2020 | 219 | 179.24 | 132.91 |
| 8 | Honor | Android | 15.72 | yes | no | 13.0 | 16.0 | 128.0 | 6.0 | 4000.0 | 165.0 | 2020 | 161 | 200.32 | 150.88 |
| 9 | Honor | Android | 21.43 | yes | no | 13.0 | 8.0 | 128.0 | 6.0 | 4000.0 | 176.0 | 2020 | 327 | 159.75 | 103.59 |
data.shape
(3571, 15)
data.columns
Index(['brand_name', 'os', 'screen_size', '4g', '5g', 'main_camera_mp',
'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'release_year', 'days_used', 'new_price', 'used_price'],
dtype='object')
data.dtypes
brand_name object os object screen_size float64 4g object 5g object main_camera_mp float64 selfie_camera_mp float64 int_memory float64 ram float64 battery float64 weight float64 release_year int64 days_used int64 new_price float64 used_price float64 dtype: object
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3571 entries, 0 to 3570 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3571 non-null object 1 os 3571 non-null object 2 screen_size 3571 non-null float64 3 4g 3571 non-null object 4 5g 3571 non-null object 5 main_camera_mp 3391 non-null float64 6 selfie_camera_mp 3569 non-null float64 7 int_memory 3561 non-null float64 8 ram 3561 non-null float64 9 battery 3565 non-null float64 10 weight 3564 non-null float64 11 release_year 3571 non-null int64 12 days_used 3571 non-null int64 13 new_price 3571 non-null float64 14 used_price 3571 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 362.7+ KB
data.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 180 selfie_camera_mp 2 int_memory 10 ram 10 battery 6 weight 7 release_year 0 days_used 0 new_price 0 used_price 0 dtype: int64
# Let's look at the statistical summary of the data
data.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3571 | 34 | Others | 509 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| os | 3571 | 4 | Android | 3246 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3571.0 | NaN | NaN | NaN | 14.803892 | 5.153092 | 2.7 | 12.7 | 13.49 | 16.51 | 46.36 |
| 4g | 3571 | 2 | yes | 2359 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5g | 3571 | 2 | no | 3419 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3391.0 | NaN | NaN | NaN | 9.400454 | 4.818396 | 0.08 | 5.0 | 8.0 | 13.0 | 48.0 |
| selfie_camera_mp | 3569.0 | NaN | NaN | NaN | 6.547352 | 6.879359 | 0.3 | 2.0 | 5.0 | 8.0 | 32.0 |
| int_memory | 3561.0 | NaN | NaN | NaN | 54.532607 | 84.696246 | 0.005 | 16.0 | 32.0 | 64.0 | 1024.0 |
| ram | 3561.0 | NaN | NaN | NaN | 4.056962 | 1.391844 | 0.03 | 4.0 | 4.0 | 4.0 | 16.0 |
| battery | 3565.0 | NaN | NaN | NaN | 3067.225666 | 1364.206665 | 80.0 | 2100.0 | 3000.0 | 4000.0 | 12000.0 |
| weight | 3564.0 | NaN | NaN | NaN | 179.424285 | 90.280856 | 23.0 | 140.0 | 159.0 | 184.0 | 950.0 |
| release_year | 3571.0 | NaN | NaN | NaN | 2015.964996 | 2.291784 | 2013.0 | 2014.0 | 2016.0 | 2018.0 | 2020.0 |
| days_used | 3571.0 | NaN | NaN | NaN | 675.391487 | 248.640972 | 91.0 | 536.0 | 690.0 | 872.0 | 1094.0 |
| new_price | 3571.0 | NaN | NaN | NaN | 237.389037 | 197.545581 | 9.13 | 120.13 | 189.8 | 291.935 | 2560.2 |
| used_price | 3571.0 | NaN | NaN | NaN | 109.880277 | 121.501226 | 2.51 | 45.205 | 75.53 | 126.0 | 1916.54 |
data.shape
(3571, 15)
# check the number of unique values in each column of the dataframe
data.nunique()
brand_name 34 os 4 screen_size 127 4g 2 5g 2 main_camera_mp 44 selfie_camera_mp 37 int_memory 16 ram 14 battery 354 weight 613 release_year 8 days_used 930 new_price 3099 used_price 3044 dtype: int64
data["brand_name"] = data["brand_name"].astype("category")
data["os"] = data["os"].astype("category")
data["4g"] = data["4g"].astype("category")
data["5g"] = data["5g"].astype("category")
data.dtypes
brand_name category os category screen_size float64 4g category 5g category main_camera_mp float64 selfie_camera_mp float64 int_memory float64 ram float64 battery float64 weight float64 release_year int64 days_used int64 new_price float64 used_price float64 dtype: object
Questions:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
Let's explore the dependent variable used-price
histogram_boxplot(data, "used_price")
Let's explore the dependent variable new-price
histogram_boxplot(data, "new_price")
Let's explore the dependent variable days_used
histogram_boxplot(data, "days_used")
Let's explore the dependent variable weight
histogram_boxplot(data, "weight")
Let's explore the dependent variable 5G
labeled_barplot(data, "5g", perc=True)
labeled_barplot(data, "4g", perc=True)
labeled_barplot(data, "brand_name", perc=True)
Let's look at correlations.
plt.figure(figsize=(15, 10))
sns.heatmap(data.corr(), annot=True, cmap="Spectral")
<AxesSubplot:>
Observations
Let's look at the graphs of a few variables that are highly correlated with used_price.
used price vs new price vs 5G status
plt.figure(figsize=(10, 8))
sns.scatterplot(y="new_price", x="used_price", hue="5g", data=data)
plt.show()
used price vs new price vs 4G status
plt.figure(figsize=(10, 8))
sns.scatterplot(y="new_price", x="used_price", hue="4g", data=data)
plt.show()
used price vs ram capacity
plt.figure(figsize=(10, 8))
sns.scatterplot(y="used_price", x="ram", data=data)
plt.show()
plt.figure(figsize=(10, 8))
sns.scatterplot(y="used_price", x="days_used", data=data)
plt.show()
used price vs selfie cam mp
plt.figure(figsize=(10, 8))
sns.scatterplot(y="used_price", x="selfie_camera_mp", data=data)
plt.show()
check the used price with release year
plt.figure(figsize=(15, 7))
sns.lineplot(x="release_year", y="used_price", data=data, ci=None)
plt.show()
histogram_boxplot(data, "used_price")
Observations
# Function to create barplots that indicate percentage for each category.
def perc_on_bar(plot, feature):
"""
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
"""
total = len(feature) # length of the column
for p in ax.patches:
percentage = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size=12) # annotate the percantage
plt.show() # show the plot
plt.figure(figsize=(15, 5))
ax = sns.countplot(data["os"], palette="winter")
perc_on_bar(ax, data["os"])
C:\Users\laptop\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
plt.figure(figsize=(25, 7))
sns.boxplot(data["brand_name"], data["ram"], palette="PuBu")
plt.show()
C:\Users\laptop\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
sns.catplot(x="brand_name", y="ram", kind="box", data=data)
<seaborn.axisgrid.FacetGrid at 0x15655658>
large_battery = data[data["battery"] >= 4500]
large_battery.head()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Honor | Android | 26.04 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | 278.93 | 180.23 |
| 4 | Honor | Android | 15.72 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | 140.87 | 103.80 |
| 11 | Honor | Android | 15.72 | yes | no | 13.0 | 8.0 | 64.0 | 4.0 | 5000.0 | 185.0 | 2020 | 344 | 117.94 | 74.60 |
| 20 | Honor | Android | 25.56 | yes | no | 5.0 | 2.0 | 32.0 | 3.0 | 5100.0 | 173.0 | 2019 | 266 | 248.90 | 167.63 |
| 21 | Honor | Android | 20.32 | yes | no | 8.0 | 8.0 | 32.0 | 3.0 | 5100.0 | 173.0 | 2019 | 321 | 201.14 | 131.89 |
large_battery.describe()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 411.000000 | 340.000000 | 411.000000 | 411.000000 | 411.000000 | 411.000000 | 411.000000 | 411.000000 | 411.000000 | 411.000000 | 411.000000 |
| mean | 21.351363 | 9.388382 | 8.993917 | 72.136253 | 4.630170 | 5665.205596 | 312.032847 | 2017.430657 | 528.922141 | 349.301088 | 191.062798 |
| std | 5.074957 | 4.772113 | 7.894384 | 96.857053 | 1.796871 | 1318.675616 | 155.422787 | 2.417807 | 278.615276 | 291.984058 | 203.719689 |
| min | 10.160000 | 0.300000 | 0.300000 | 16.000000 | 1.000000 | 4500.000000 | 23.000000 | 2013.000000 | 92.000000 | 80.820000 | 33.090000 |
| 25% | 16.350000 | 5.000000 | 2.000000 | 24.000000 | 4.000000 | 4850.000000 | 195.000000 | 2015.000000 | 288.500000 | 189.305000 | 85.860000 |
| 50% | 20.960000 | 8.000000 | 8.000000 | 32.000000 | 4.000000 | 5000.000000 | 220.000000 | 2018.000000 | 506.000000 | 269.850000 | 125.570000 |
| 75% | 25.560000 | 13.000000 | 16.000000 | 128.000000 | 4.000000 | 6185.000000 | 450.000000 | 2020.000000 | 737.000000 | 398.180000 | 200.085000 |
| max | 46.360000 | 48.000000 | 32.000000 | 1024.000000 | 12.000000 | 12000.000000 | 950.000000 | 2020.000000 | 1089.000000 | 2560.200000 | 1916.540000 |
histogram_boxplot(large_battery, "weight")
6 inches = 15.24 cm
large_screen = data[data["screen_size"] > 15.24]
large_screen.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1235 entries, 0 to 3569 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 1235 non-null category 1 os 1235 non-null category 2 screen_size 1235 non-null float64 3 4g 1235 non-null category 4 5g 1235 non-null category 5 main_camera_mp 1062 non-null float64 6 selfie_camera_mp 1234 non-null float64 7 int_memory 1235 non-null float64 8 ram 1235 non-null float64 9 battery 1233 non-null float64 10 weight 1235 non-null float64 11 release_year 1235 non-null int64 12 days_used 1235 non-null int64 13 new_price 1235 non-null float64 14 used_price 1235 non-null float64 dtypes: category(4), float64(9), int64(2) memory usage: 121.5 KB
large_screen.groupby("brand_name").count().sort_values(["os"], ascending=False)
| os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | ||||||||||||||
| Huawei | 159 | 159 | 159 | 159 | 159 | 159 | 159 | 159 | 159 | 159 | 159 | 159 | 159 | 159 |
| Samsung | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 |
| Others | 116 | 116 | 116 | 116 | 116 | 116 | 116 | 116 | 116 | 116 | 116 | 116 | 116 | 116 |
| Honor | 87 | 87 | 87 | 87 | 87 | 87 | 87 | 87 | 87 | 87 | 87 | 87 | 87 | 87 |
| Vivo | 86 | 86 | 86 | 86 | 72 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 | 86 |
| Xiaomi | 85 | 85 | 85 | 85 | 62 | 85 | 85 | 85 | 85 | 85 | 85 | 85 | 85 | 85 |
| Lenovo | 72 | 72 | 72 | 72 | 72 | 72 | 72 | 72 | 72 | 72 | 72 | 72 | 72 | 72 |
| Oppo | 70 | 70 | 70 | 70 | 50 | 70 | 70 | 70 | 70 | 70 | 70 | 70 | 70 | 70 |
| LG | 68 | 68 | 68 | 68 | 68 | 68 | 68 | 68 | 68 | 68 | 68 | 68 | 68 | 68 |
| Asus | 44 | 44 | 44 | 44 | 40 | 44 | 44 | 44 | 44 | 44 | 44 | 44 | 44 | 44 |
| Motorola | 44 | 44 | 44 | 44 | 26 | 44 | 44 | 44 | 44 | 44 | 44 | 44 | 44 | 44 |
| Realme | 40 | 40 | 40 | 40 | 4 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 |
| Nokia | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 |
| Alcatel | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 28 |
| Meizu | 25 | 25 | 25 | 25 | 10 | 25 | 25 | 25 | 24 | 25 | 25 | 25 | 25 | 25 |
| Apple | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 24 | 24 |
| ZTE | 22 | 22 | 22 | 22 | 18 | 22 | 22 | 22 | 22 | 22 | 22 | 22 | 22 | 22 |
| Acer | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 | 19 |
| OnePlus | 16 | 16 | 16 | 16 | 0 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 |
| Sony | 14 | 14 | 14 | 14 | 10 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 | 14 |
| Micromax | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
| Infinix | 10 | 10 | 10 | 10 | 0 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| HTC | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
| Gionee | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 |
| XOLO | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| 4 | 4 | 4 | 4 | 4 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | |
| Coolpad | 3 | 3 | 3 | 3 | 0 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| Celkon | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| BlackBerry | 2 | 2 | 2 | 2 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| Lava | 2 | 2 | 2 | 2 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| Spice | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| Karbonn | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| Panasonic | 2 | 2 | 2 | 2 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
| Microsoft | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
# visualize brand shares within large screen phone category
plt.figure(figsize=(25, 10))
ax = sns.countplot(large_screen["brand_name"], palette="winter")
perc_on_bar(ax, large_screen["brand_name"])
C:\Users\laptop\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
budget_phone= data.loc[(data['new_price']< 300) & (data['selfie_camera_mp']> 8)]
budget_phone.describe()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 338.000000 | 338.000000 | 338.000000 | 338.000000 | 338.0 | 338.000000 | 338.000000 | 338.000000 | 338.000000 | 338.000000 | 338.000000 |
| mean | 16.939172 | 11.516568 | 15.581361 | 75.668639 | 4.0 | 3736.464497 | 173.905325 | 2018.618343 | 426.002959 | 216.810675 | 125.084970 |
| std | 3.607834 | 3.553753 | 1.555132 | 39.897602 | 0.0 | 924.955208 | 25.925324 | 1.205534 | 205.467745 | 51.860596 | 41.482757 |
| min | 6.985000 | 2.000000 | 9.000000 | 4.000000 | 4.0 | 230.000000 | 74.000000 | 2014.000000 | 91.000000 | 99.700000 | 35.740000 |
| 25% | 15.240000 | 8.000000 | 16.000000 | 64.000000 | 4.0 | 3260.000000 | 163.000000 | 2018.000000 | 269.250000 | 170.890000 | 94.932500 |
| 50% | 15.880000 | 13.000000 | 16.000000 | 64.000000 | 4.0 | 4000.000000 | 175.500000 | 2019.000000 | 388.000000 | 219.385000 | 121.780000 |
| 75% | 19.960000 | 13.000000 | 16.000000 | 128.000000 | 4.0 | 4035.000000 | 191.000000 | 2019.000000 | 560.000000 | 259.665000 | 151.142500 |
| max | 22.225000 | 25.000000 | 17.000000 | 136.000000 | 4.0 | 6000.000000 | 250.000000 | 2020.000000 | 1091.000000 | 299.820000 | 223.970000 |
plt.figure(figsize=(25, 15))
ax = sns.countplot(budget_phone["brand_name"], palette="winter")
perc_on_bar(ax, budget_phone["brand_name"])
C:\Users\laptop\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
# correlation of all attributes with used phone price
data[data.columns[:]].corr()["used_price"][:]
screen_size 0.508230 main_camera_mp 0.403944 selfie_camera_mp 0.641331 int_memory 0.592726 ram NaN battery 0.485956 weight 0.358080 release_year 0.539031 days_used -0.553117 new_price 0.907948 used_price 1.000000 Name: used_price, dtype: float64
plt.figure(figsize=(10, 7))
sns.scatterplot(y="used_price", x="new_price", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.scatterplot(y="used_price", x="days_used", data=data)
plt.show()
## recheck the pattern in the missingness
data.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 180 selfie_camera_mp 2 int_memory 10 ram 10 battery 6 weight 7 release_year 0 days_used 0 new_price 0 used_price 0 dtype: int64
# looking at which columns have the most missing values
data.isnull().sum().sort_values(ascending=False)
main_camera_mp 180 int_memory 10 ram 10 weight 7 battery 6 selfie_camera_mp 2 brand_name 0 os 0 screen_size 0 4g 0 5g 0 release_year 0 days_used 0 new_price 0 used_price 0 dtype: int64
Let's fix the missing values.
medianFiller = lambda x: x.fillna(x.median())
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
data[numeric_columns] = data[numeric_columns].apply(medianFiller, axis=0)
# checking the number of missing values
data.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 new_price 0 used_price 0 dtype: int64
Now we don't have any missing value
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3571 entries, 0 to 3570 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3571 non-null category 1 os 3571 non-null category 2 screen_size 3571 non-null float64 3 4g 3571 non-null category 4 5g 3571 non-null category 5 main_camera_mp 3571 non-null float64 6 selfie_camera_mp 3571 non-null float64 7 int_memory 3571 non-null float64 8 ram 3571 non-null float64 9 battery 3571 non-null float64 10 weight 3571 non-null float64 11 release_year 3571 non-null int64 12 days_used 3571 non-null int64 13 new_price 3571 non-null float64 14 used_price 3571 non-null float64 dtypes: category(4), float64(9), int64(2) memory usage: 321.8 KB
# define numeric_columns
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# let's plot the boxplots of all columns to check for outliers
plt.figure(figsize=(20, 25))
for i, variable in enumerate(numeric_columns):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
There is no lower outliers in any variable
There are no outliers in days_used, released_year.
There are high outliers in screen size, main camera, selfie camera,int_memory, battery, weight, new price and used price
We will treat these outliers as these might adversely affect the predictive power of linear model.
def treat_outliers(df, col):
"""
treats outliers in a variable
col: str, name of the numerical variable
df: dataframe
col: name of the column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
"""
treat outlier in all numerical variables
col_list: list of numerical variables
df: data frame
"""
for c in col_list:
df = treat_outliers(df, c)
return df
# treating the outliers
numerical_col = data.select_dtypes(include=np.number).columns.tolist()
data = treat_outliers_all(data, numerical_col)
# let's look at the boxplots to see if the outliers have been treated or not
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numeric_columns):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
histogram_boxplot(data, "used_price")
histogram_boxplot(data, "new_price")
histogram_boxplot(data, "days_used")
histogram_boxplot(data, "weight")
labeled_barplot(data, "5g", perc=True)
labeled_barplot(data, "4g", perc=True)
labeled_barplot(data, "brand_name", perc=True)
Look at the Correlation
plt.figure(figsize=(15, 10))
sns.heatmap(data.corr(), annot=True, cmap="Spectral")
<AxesSubplot:>
Observation
using pairplot to visualise
sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x7992a30>
plt.figure(figsize=(10, 8))
sns.scatterplot(y="new_price", x="used_price", hue="5g", data=data)
plt.show()
plt.figure(figsize=(10, 8))
sns.scatterplot(y="new_price", x="used_price", hue="4g", data=data)
plt.show()
brandname vs used price
plt.figure(figsize=(17, 9))
sns.boxplot(y="brand_name", x="used_price", data=data)
plt.show()
plt.figure(figsize=(17, 9))
sns.boxplot(y="screen_size", x="used_price", data=data)
plt.show()
check used price vs 5g vs release year
plt.figure(figsize=(20, 10))
sns.scatterplot(y="release_year", x="used_price", hue="5g", data=data)
plt.show()
check used price vs 4g vs release year
plt.figure(figsize=(20, 10))
sns.scatterplot(y="release_year", x="used_price", hue="4g", data=data)
plt.show()
check used price vs year
plt.figure(figsize=(15, 7))
sns.lineplot(x="release_year", y="used_price", data=data, ci=None)
plt.show()
plt.figure(figsize=(10, 8))
sns.scatterplot(y="used_price", x="selfie_camera_mp", data=data)
plt.show()
plt.figure(figsize=(10, 8))
sns.scatterplot(y="used_price", x="ram", data=data)
plt.show()
plt.figure(figsize=(10, 8))
sns.scatterplot(y="used_price", x="days_used", data=data)
plt.show()
We want to predict the used phone price.
Before we proceed to build a model, we'll have to encode categorical features.
We'll split the data into train and test to be able to evaluate the model that we build on the train data.
We will build a Linear Regression model using the train data and then check it's performance.
# defining X and y variables
X = data.drop("used_price", axis=1)
y = data["used_price"]
print(X.head())
print(y.head())
brand_name os screen_size 4g 5g main_camera_mp \ 0 Honor Android 22.225 yes no 13.0 1 Honor Android 22.225 yes yes 13.0 2 Honor Android 22.225 yes yes 13.0 3 Honor Android 22.225 yes yes 13.0 4 Honor Android 15.720 yes no 13.0 selfie_camera_mp int_memory ram battery weight release_year \ 0 5.0 64.0 4.0 3020.0 146.0 2020 1 16.0 128.0 4.0 4300.0 213.0 2020 2 8.0 128.0 4.0 4200.0 213.0 2020 3 8.0 64.0 4.0 6850.0 250.0 2020 4 8.0 64.0 4.0 5000.0 185.0 2020 days_used new_price 0 127 111.62 1 325 249.39 2 162 359.47 3 345 278.93 4 293 140.87 0 86.9600 1 161.4900 2 247.1925 3 180.2300 4 103.8000 Name: used_price, dtype: float64
data.describe()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3571.000000 | 3571.000000 | 3571.000000 | 3571.000000 | 3571.0 | 3571.000000 | 3571.000000 | 3571.000000 | 3571.000000 | 3571.000000 | 3571.000000 |
| mean | 14.521803 | 9.299619 | 5.972417 | 44.583202 | 4.0 | 3042.193083 | 164.415584 | 2015.964996 | 675.391487 | 221.841506 | 95.528957 |
| std | 4.165771 | 4.530650 | 5.287271 | 38.938853 | 0.0 | 1272.573404 | 41.098253 | 2.291784 | 248.640972 | 135.411699 | 66.145611 |
| min | 6.985000 | 0.080000 | 0.300000 | 0.005000 | 4.0 | 80.000000 | 74.000000 | 2013.000000 | 91.000000 | 9.130000 | 2.510000 |
| 25% | 12.700000 | 5.000000 | 2.000000 | 16.000000 | 4.0 | 2100.000000 | 140.000000 | 2014.000000 | 536.000000 | 120.130000 | 45.205000 |
| 50% | 13.490000 | 8.000000 | 5.000000 | 32.000000 | 4.0 | 3000.000000 | 159.000000 | 2016.000000 | 690.000000 | 189.800000 | 75.530000 |
| 75% | 16.510000 | 13.000000 | 8.000000 | 64.000000 | 4.0 | 4000.000000 | 184.000000 | 2018.000000 | 872.000000 | 291.935000 | 126.000000 |
| max | 22.225000 | 25.000000 | 17.000000 | 136.000000 | 4.0 | 6850.000000 | 250.000000 | 2020.000000 | 1094.000000 | 549.642500 | 247.192500 |
# encode the category via one hot encoding
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
X.head()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | ... | brand_name_Spice | brand_name_Vivo | brand_name_XOLO | brand_name_Xiaomi | brand_name_ZTE | os_Others | os_Windows | os_iOS | 4g_yes | 5g_yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22.225 | 13.0 | 5.0 | 64.0 | 4.0 | 3020.0 | 146.0 | 2020 | 127 | 111.62 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 22.225 | 13.0 | 16.0 | 128.0 | 4.0 | 4300.0 | 213.0 | 2020 | 325 | 249.39 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 2 | 22.225 | 13.0 | 8.0 | 128.0 | 4.0 | 4200.0 | 213.0 | 2020 | 162 | 359.47 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 3 | 22.225 | 13.0 | 8.0 | 64.0 | 4.0 | 6850.0 | 250.0 | 2020 | 345 | 278.93 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4 | 15.720 | 13.0 | 8.0 | 64.0 | 4.0 | 5000.0 | 185.0 | 2020 | 293 | 140.87 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 48 columns
# splitting the data in 70:30 ratio for train to test data
x_train, x_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 2499 Number of rows in test data = 1072
# fitting the linear regression model on the train data (70% of the whole data)
linearregression = LinearRegression()
linearregression.fit(x_train, y_train)
LinearRegression()
Let's check the coefficients and intercept of the model.
coef_df = pd.DataFrame(
np.append(linearregression.coef_, linearregression.intercept_),
index=x_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df
| Coefficients | |
|---|---|
| screen_size | 2.076879e-01 |
| main_camera_mp | -2.742636e-01 |
| selfie_camera_mp | 8.380688e-01 |
| int_memory | 8.588868e-02 |
| ram | -1.372069e-11 |
| battery | 1.355481e-04 |
| weight | -1.226033e-02 |
| release_year | -2.459011e-01 |
| days_used | -8.460932e-02 |
| new_price | 3.844636e-01 |
| brand_name_Alcatel | 1.804391e-01 |
| brand_name_Apple | 2.324314e+01 |
| brand_name_Asus | 5.894185e-01 |
| brand_name_BlackBerry | 6.697251e+00 |
| brand_name_Celkon | -7.164660e-01 |
| brand_name_Coolpad | 6.615371e+00 |
| brand_name_Gionee | -5.287759e+00 |
| brand_name_Google | 1.222934e+01 |
| brand_name_HTC | -6.395954e-01 |
| brand_name_Honor | 2.071706e+00 |
| brand_name_Huawei | 9.721871e-01 |
| brand_name_Infinix | -1.490254e+01 |
| brand_name_Karbonn | -1.357333e+00 |
| brand_name_LG | 1.744767e+00 |
| brand_name_Lava | -1.321284e+00 |
| brand_name_Lenovo | -1.720285e+00 |
| brand_name_Meizu | -6.605491e-01 |
| brand_name_Micromax | 1.198670e+00 |
| brand_name_Microsoft | -3.479350e+00 |
| brand_name_Motorola | -7.728710e-01 |
| brand_name_Nokia | -7.482801e+00 |
| brand_name_OnePlus | -1.687551e+01 |
| brand_name_Oppo | -2.627715e+00 |
| brand_name_Others | 4.194104e-01 |
| brand_name_Panasonic | 1.331692e+00 |
| brand_name_Realme | -2.678709e+00 |
| brand_name_Samsung | 5.777150e-01 |
| brand_name_Sony | 1.032435e+00 |
| brand_name_Spice | 1.771798e+00 |
| brand_name_Vivo | 9.158406e-01 |
| brand_name_XOLO | 2.836924e+00 |
| brand_name_Xiaomi | -1.318259e+00 |
| brand_name_ZTE | -5.036687e-01 |
| os_Others | -4.003459e+00 |
| os_Windows | 2.528868e+00 |
| os_iOS | -1.580834e+01 |
| 4g_yes | -2.884339e+00 |
| 5g_yes | 2.122906e+00 |
| Intercept | 5.574842e+02 |
Let make predictions on the test set (X_test) with the model, and compare the actual output values with the predicted values.
# predictions on the test set
pred = linearregression.predict(x_test)
df_pred_test = pd.DataFrame({"Actual": y_test, "Predicted": pred})
df_pred_test.head(10)
| Actual | Predicted | |
|---|---|---|
| 457 | 38.7700 | 24.193634 |
| 1647 | 120.2900 | 115.036152 |
| 351 | 247.1925 | 239.689713 |
| 1667 | 80.1400 | 79.573036 |
| 1849 | 60.1800 | 68.242898 |
| 2111 | 74.5600 | 77.445702 |
| 1366 | 39.2700 | 22.932912 |
| 70 | 79.1400 | 82.206502 |
| 1206 | 80.3300 | 90.870223 |
| 1213 | 63.0200 | 81.968996 |
We will be using metric functions : RMSE, MAE, and 𝑅2 .
We will define a function to calculate MAPE and adjusted 𝑅2 .
The mean absolute percentage error (MAPE) measures the accuracy of predictions as a percentage, and can be calculated as the average absolute percent error for each predicted value minus actual values divided by actual values. It works best if there are no extreme values in the data and none of the actual values are 0. We will create a function which will print out all the above metrics in one go.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
data1_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return data1_perf
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
linearregression_train_perf = model_performance_regression(
linearregression, x_train, y_train
)
linearregression_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 13.960441 | 10.222224 | 0.955136 | 0.954257 | 18.489055 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
linearregression_test_perf = model_performance_regression(
linearregression, x_test, y_test
)
linearregression_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 13.74532 | 10.171443 | 0.957443 | 0.955446 | 16.417574 |
Observations
import statsmodels.api as sm
# Add constant to train data
x_train1 = sm.add_constant(x_train)
# Add constant to test data
x_test1 = sm.add_constant(x_test)
# Create model on train data
olsmod0 = sm.OLS(y_train, x_train1).fit()
print(olsmod0.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price R-squared: 0.955
Model: OLS Adj. R-squared: 0.954
Method: Least Squares F-statistic: 1110.
Date: Tue, 17 Aug 2021 Prob (F-statistic): 0.00
Time: 13:12:22 Log-Likelihood: -10134.
No. Observations: 2499 AIC: 2.036e+04
Df Residuals: 2451 BIC: 2.064e+04
Df Model: 47
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
screen_size 0.2077 0.127 1.633 0.103 -0.042 0.457
main_camera_mp -0.2743 0.092 -2.984 0.003 -0.455 -0.094
selfie_camera_mp 0.8381 0.105 8.008 0.000 0.633 1.043
int_memory 0.0859 0.011 7.998 0.000 0.065 0.107
ram 139.3711 138.265 1.008 0.314 -131.758 410.500
battery 0.0001 0.000 0.313 0.754 -0.001 0.001
weight -0.0123 0.012 -0.989 0.323 -0.037 0.012
release_year -0.2459 0.274 -0.896 0.370 -0.784 0.292
days_used -0.0846 0.002 -46.373 0.000 -0.088 -0.081
new_price 0.3845 0.003 111.668 0.000 0.378 0.391
brand_name_Alcatel 0.1804 2.956 0.061 0.951 -5.616 5.977
brand_name_Apple 23.2431 10.523 2.209 0.027 2.609 43.878
brand_name_Asus 0.5894 2.945 0.200 0.841 -5.185 6.364
brand_name_BlackBerry 6.6973 4.641 1.443 0.149 -2.404 15.798
brand_name_Celkon -0.7165 3.855 -0.186 0.853 -8.277 6.844
brand_name_Coolpad 6.6154 4.571 1.447 0.148 -2.347 15.578
brand_name_Gionee -5.2878 3.405 -1.553 0.121 -11.964 1.389
brand_name_Google 12.2293 4.860 2.516 0.012 2.699 21.760
brand_name_HTC -0.6396 3.000 -0.213 0.831 -6.523 5.244
brand_name_Honor 2.0717 3.021 0.686 0.493 -3.852 7.995
brand_name_Huawei 0.9722 2.762 0.352 0.725 -4.444 6.388
brand_name_Infinix -14.9025 6.366 -2.341 0.019 -27.385 -2.420
brand_name_Karbonn -1.3573 3.880 -0.350 0.726 -8.965 6.251
brand_name_LG 1.7448 2.778 0.628 0.530 -3.703 7.193
brand_name_Lava -1.3213 3.711 -0.356 0.722 -8.598 5.956
brand_name_Lenovo -1.7203 2.849 -0.604 0.546 -7.308 3.867
brand_name_Meizu -0.6605 3.363 -0.196 0.844 -7.256 5.935
brand_name_Micromax 1.1987 2.946 0.407 0.684 -4.578 6.976
brand_name_Microsoft -3.4793 4.950 -0.703 0.482 -13.186 6.227
brand_name_Motorola -0.7729 2.980 -0.259 0.795 -6.616 5.070
brand_name_Nokia -7.4828 3.025 -2.474 0.013 -13.415 -1.551
brand_name_OnePlus -16.8755 4.373 -3.859 0.000 -25.450 -8.301
brand_name_Oppo -2.6277 2.966 -0.886 0.376 -8.444 3.188
brand_name_Others 0.4194 2.612 0.161 0.872 -4.702 5.541
brand_name_Panasonic 1.3317 3.483 0.382 0.702 -5.499 8.162
brand_name_Realme -2.6787 3.859 -0.694 0.488 -10.246 4.889
brand_name_Samsung 0.5777 2.682 0.215 0.829 -4.681 5.837
brand_name_Sony 1.0324 3.164 0.326 0.744 -5.171 7.236
brand_name_Spice 1.7718 4.044 0.438 0.661 -6.158 9.702
brand_name_Vivo 0.9158 3.072 0.298 0.766 -5.108 6.940
brand_name_XOLO 2.8369 3.443 0.824 0.410 -3.915 9.589
brand_name_Xiaomi -1.3183 2.952 -0.447 0.655 -7.107 4.471
brand_name_ZTE -0.5037 2.978 -0.169 0.866 -6.344 5.336
os_Others -4.0035 1.507 -2.657 0.008 -6.958 -1.049
os_Windows 2.5289 2.718 0.930 0.352 -2.801 7.859
os_iOS -15.8083 10.454 -1.512 0.131 -36.308 4.691
4g_yes -2.8843 0.914 -3.155 0.002 -4.677 -1.092
5g_yes 2.1229 1.723 1.232 0.218 -1.256 5.502
==============================================================================
Omnibus: 288.376 Durbin-Watson: 1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB): 655.279
Skew: 0.685 Prob(JB): 5.10e-143
Kurtosis: 5.102 Cond. No. 1.88e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.88e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Observations
Negative values of the coefficient show that used_price decreases with the increase of corresponding attribute value.
Positive values of the coefficient show that used_price increases with the increase of corresponding attribute value.
p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.
We will need to deal with multicollinearity and check other assumption of linear regression first and look at the p-value
We will be checking the following Linear Regression assumptions:
No Multicollinearity
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
Multicollinearity occurs when predictor variables in a regression model are correlated. This correlation is a problem because predictor variables should be independent. If the correlation between variables is high, it can cause problems when we fit the model and interpret the results. When we have multicollinearity in the linear model, the coefficients that the model suggests are unreliable.
There are different ways of detecting (or testing) multicollinearity. One such way is by using the Variance Inflation Factor, or VIF.
Variance Inflation Factor (VIF): Variance inflation factors measure the inflation in the variances of the regression parameter estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient $\beta_k$ is "inflated" by the existence of correlation among the predictor variables in the model.
General Rule of thumb:
- If VIF is between 1 and 5, then there is low multicollinearity.
- If VIF is between 5 and 10, we say there is moderate multicollinearity.
- If VIF is exceeding 10, it shows signs of high multicollinearity.
from statsmodels.stats.outliers_influence import variance_inflation_factor
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(x_train1)
| feature | VIF | |
|---|---|---|
| 0 | screen_size | 3.529005e+00 |
| 1 | main_camera_mp | 2.227816e+00 |
| 2 | selfie_camera_mp | 3.860192e+00 |
| 3 | int_memory | 2.187474e+00 |
| 4 | ram | 3.846732e+06 |
| 5 | battery | 3.730602e+00 |
| 6 | weight | 3.202294e+00 |
| 7 | release_year | 5.000190e+00 |
| 8 | days_used | 2.631834e+00 |
| 9 | new_price | 2.714314e+00 |
| 10 | brand_name_Alcatel | 3.487238e+00 |
| 11 | brand_name_Apple | 2.139294e+01 |
| 12 | brand_name_Asus | 3.623806e+00 |
| 13 | brand_name_BlackBerry | 1.509076e+00 |
| 14 | brand_name_Celkon | 1.997885e+00 |
| 15 | brand_name_Coolpad | 1.463510e+00 |
| 16 | brand_name_Gionee | 2.183094e+00 |
| 17 | brand_name_Google | 1.419538e+00 |
| 18 | brand_name_HTC | 3.423093e+00 |
| 19 | brand_name_Honor | 3.727736e+00 |
| 20 | brand_name_Huawei | 6.478001e+00 |
| 21 | brand_name_Infinix | 1.220558e+00 |
| 22 | brand_name_Karbonn | 1.726326e+00 |
| 23 | brand_name_LG | 5.612996e+00 |
| 24 | brand_name_Lava | 1.850924e+00 |
| 25 | brand_name_Lenovo | 4.407972e+00 |
| 26 | brand_name_Meizu | 2.405741e+00 |
| 27 | brand_name_Micromax | 3.545658e+00 |
| 28 | brand_name_Microsoft | 2.081773e+00 |
| 29 | brand_name_Motorola | 3.834844e+00 |
| 30 | brand_name_Nokia | 3.952874e+00 |
| 31 | brand_name_OnePlus | 1.624763e+00 |
| 32 | brand_name_Oppo | 4.086727e+00 |
| 33 | brand_name_Others | 1.055346e+01 |
| 34 | brand_name_Panasonic | 2.107166e+00 |
| 35 | brand_name_Realme | 1.854854e+00 |
| 36 | brand_name_Samsung | 8.460536e+00 |
| 37 | brand_name_Sony | 2.901733e+00 |
| 38 | brand_name_Spice | 1.632728e+00 |
| 39 | brand_name_Vivo | 3.499731e+00 |
| 40 | brand_name_XOLO | 2.117069e+00 |
| 41 | brand_name_Xiaomi | 4.169762e+00 |
| 42 | brand_name_ZTE | 3.581848e+00 |
| 43 | os_Others | 1.631329e+00 |
| 44 | os_Windows | 1.750184e+00 |
| 45 | os_iOS | 2.004830e+01 |
| 46 | 4g_yes | 2.375675e+00 |
| 47 | 5g_yes | 1.475605e+00 |
To remove multicollinearity
# Define a function to do this
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
# empty lists to store adj. R-squared and RMSE values
adj_r2 = []
rmse = []
# build ols models by dropping one of the high VIF columns at a time
# store the adjusted R-squared and RMSE in the lists defined previously
for cols in high_vif_columns:
# defining the new train set
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
# create the model
olsmodel = sm.OLS(target, train).fit()
# adding adj. R-squared and RMSE to the lists
adj_r2.append(olsmodel.rsquared_adj)
rmse.append(np.sqrt(olsmodel.mse_resid))
# creating a dataframe for the results
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Adj. R-squared after_dropping col": adj_r2,
"RMSE after dropping col": rmse,
}
).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
col_list = [
"brand_name_Samsung",
"brand_name_Others",
"brand_name_Huawei",
]
res = treating_multicollinearity(x_train1, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | brand_name_Others | 0.954293 | 14.093678 |
| 1 | brand_name_Samsung | 0.954293 | 14.093737 |
| 2 | brand_name_Huawei | 0.954292 | 14.093960 |
col_to_drop = "brand_name_Huawei"
x_train2 = x_train1.loc[:, ~x_train1.columns.str.startswith(col_to_drop)]
x_test2 = x_test1.loc[:, ~x_test1.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train2)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping brand_name_Huawei
| feature | VIF | |
|---|---|---|
| 0 | screen_size | 3.528662e+00 |
| 1 | main_camera_mp | 2.226834e+00 |
| 2 | selfie_camera_mp | 3.854456e+00 |
| 3 | int_memory | 2.182376e+00 |
| 4 | ram | 3.846727e+06 |
| 5 | battery | 3.729410e+00 |
| 6 | weight | 3.200223e+00 |
| 7 | release_year | 5.000061e+00 |
| 8 | days_used | 2.631833e+00 |
| 9 | new_price | 2.713847e+00 |
| 10 | brand_name_Alcatel | 1.408481e+00 |
| 11 | brand_name_Apple | 2.027214e+01 |
| 12 | brand_name_Asus | 1.397583e+00 |
| 13 | brand_name_BlackBerry | 1.121888e+00 |
| 14 | brand_name_Celkon | 1.297064e+00 |
| 15 | brand_name_Coolpad | 1.075584e+00 |
| 16 | brand_name_Gionee | 1.184003e+00 |
| 17 | brand_name_Google | 1.085809e+00 |
| 18 | brand_name_HTC | 1.381732e+00 |
| 19 | brand_name_Honor | 1.399717e+00 |
| 20 | brand_name_Infinix | 1.049977e+00 |
| 21 | brand_name_Karbonn | 1.148931e+00 |
| 22 | brand_name_LG | 1.688529e+00 |
| 23 | brand_name_Lava | 1.158927e+00 |
| 24 | brand_name_Lenovo | 1.511128e+00 |
| 25 | brand_name_Meizu | 1.201736e+00 |
| 26 | brand_name_Micromax | 1.450894e+00 |
| 27 | brand_name_Microsoft | 1.681880e+00 |
| 28 | brand_name_Motorola | 1.432411e+00 |
| 29 | brand_name_Nokia | 1.668399e+00 |
| 30 | brand_name_OnePlus | 1.128346e+00 |
| 31 | brand_name_Oppo | 1.461185e+00 |
| 32 | brand_name_Others | 2.459057e+00 |
| 33 | brand_name_Panasonic | 1.178409e+00 |
| 34 | brand_name_Realme | 1.141028e+00 |
| 35 | brand_name_Samsung | 2.110912e+00 |
| 36 | brand_name_Sony | 1.325149e+00 |
| 37 | brand_name_Spice | 1.133585e+00 |
| 38 | brand_name_Vivo | 1.362363e+00 |
| 39 | brand_name_XOLO | 1.217775e+00 |
| 40 | brand_name_Xiaomi | 1.447655e+00 |
| 41 | brand_name_ZTE | 1.377234e+00 |
| 42 | os_Others | 1.630131e+00 |
| 43 | os_Windows | 1.747353e+00 |
| 44 | os_iOS | 2.004682e+01 |
| 45 | 4g_yes | 2.360472e+00 |
| 46 | 5g_yes | 1.475590e+00 |
olsmod1 = sm.OLS(y_train, x_train2).fit()
print(olsmod1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price R-squared: 0.955
Model: OLS Adj. R-squared: 0.954
Method: Least Squares F-statistic: 1135.
Date: Tue, 17 Aug 2021 Prob (F-statistic): 0.00
Time: 13:15:29 Log-Likelihood: -10134.
No. Observations: 2499 AIC: 2.036e+04
Df Residuals: 2452 BIC: 2.064e+04
Df Model: 46
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
screen_size 0.2081 0.127 1.637 0.102 -0.041 0.457
main_camera_mp -0.2736 0.092 -2.977 0.003 -0.454 -0.093
selfie_camera_mp 0.8395 0.105 8.029 0.000 0.634 1.045
int_memory 0.0861 0.011 8.026 0.000 0.065 0.107
ram 139.3195 138.241 1.008 0.314 -131.761 410.400
battery 0.0001 0.000 0.319 0.750 -0.001 0.001
weight -0.0124 0.012 -0.998 0.318 -0.037 0.012
release_year -0.2454 0.274 -0.895 0.371 -0.783 0.292
days_used -0.0846 0.002 -46.381 0.000 -0.088 -0.081
new_price 0.3844 0.003 111.693 0.000 0.378 0.391
brand_name_Alcatel -0.6229 1.878 -0.332 0.740 -4.306 3.060
brand_name_Apple 22.3953 10.242 2.187 0.029 2.312 42.478
brand_name_Asus -0.2230 1.828 -0.122 0.903 -3.808 3.362
brand_name_BlackBerry 5.8698 4.001 1.467 0.142 -1.976 13.715
brand_name_Celkon -1.5202 3.106 -0.489 0.625 -7.611 4.570
brand_name_Coolpad 5.7871 3.918 1.477 0.140 -1.895 13.469
brand_name_Gionee -6.0985 2.507 -2.433 0.015 -11.014 -1.183
brand_name_Google 11.3999 4.250 2.682 0.007 3.066 19.733
brand_name_HTC -1.4551 1.906 -0.764 0.445 -5.192 2.282
brand_name_Honor 1.2314 1.851 0.665 0.506 -2.398 4.861
brand_name_Infinix -15.7402 5.903 -2.666 0.008 -27.315 -4.165
brand_name_Karbonn -2.1471 3.165 -0.678 0.498 -8.353 4.058
brand_name_LG 0.9271 1.523 0.609 0.543 -2.060 3.915
brand_name_Lava -2.1200 2.936 -0.722 0.470 -7.877 3.637
brand_name_Lenovo -2.5333 1.668 -1.519 0.129 -5.804 0.737
brand_name_Meizu -1.4981 2.377 -0.630 0.529 -6.159 3.162
brand_name_Micromax 0.4016 1.884 0.213 0.831 -3.293 4.096
brand_name_Microsoft -4.2430 4.448 -0.954 0.340 -12.966 4.480
brand_name_Motorola -1.6030 1.821 -0.880 0.379 -5.173 1.967
brand_name_Nokia -8.2923 1.965 -4.220 0.000 -12.146 -4.439
brand_name_OnePlus -17.7263 3.643 -4.865 0.000 -24.871 -10.582
brand_name_Oppo -3.4645 1.773 -1.954 0.051 -6.942 0.013
brand_name_Others -0.3857 1.261 -0.306 0.760 -2.857 2.086
brand_name_Panasonic 0.5177 2.604 0.199 0.842 -4.589 5.625
brand_name_Realme -3.5214 3.026 -1.164 0.245 -9.456 2.413
brand_name_Samsung -0.2401 1.339 -0.179 0.858 -2.867 2.386
brand_name_Sony 0.2116 2.138 0.099 0.921 -3.980 4.403
brand_name_Spice 0.9848 3.369 0.292 0.770 -5.621 7.591
brand_name_Vivo 0.0708 1.916 0.037 0.971 -3.687 3.829
brand_name_XOLO 2.0470 2.611 0.784 0.433 -3.073 7.167
brand_name_Xiaomi -2.1578 1.739 -1.241 0.215 -5.568 1.252
brand_name_ZTE -1.3261 1.846 -0.718 0.473 -4.947 2.295
os_Others -3.9891 1.506 -2.649 0.008 -6.942 -1.036
os_Windows 2.4904 2.715 0.917 0.359 -2.834 7.815
os_iOS -15.7768 10.452 -1.509 0.131 -36.272 4.719
4g_yes -2.8586 0.911 -3.138 0.002 -4.645 -1.072
5g_yes 2.1248 1.723 1.233 0.218 -1.253 5.503
==============================================================================
Omnibus: 288.914 Durbin-Watson: 1.983
Prob(Omnibus): 0.000 Jarque-Bera (JB): 656.981
Skew: 0.686 Prob(JB): 2.18e-143
Kurtosis: 5.104 Cond. No. 1.88e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.88e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Observations
# initial list of columns
cols = x_train2.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = x_train2[cols]
# fitting the model
model = sm.OLS(y_train, x_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['main_camera_mp', 'selfie_camera_mp', 'int_memory', 'ram', 'days_used', 'new_price', 'brand_name_Apple', 'brand_name_Gionee', 'brand_name_Google', 'brand_name_Infinix', 'brand_name_Nokia', 'brand_name_OnePlus', 'os_Others', '4g_yes']
x_train3 = x_train2[selected_features]
x_test3 = x_test2[selected_features]
olsmod2 = sm.OLS(y_train, x_train3).fit()
print(olsmod2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price R-squared: 0.955
Model: OLS Adj. R-squared: 0.954
Method: Least Squares F-statistic: 4016.
Date: Tue, 17 Aug 2021 Prob (F-statistic): 0.00
Time: 13:15:39 Log-Likelihood: -10150.
No. Observations: 2499 AIC: 2.033e+04
Df Residuals: 2485 BIC: 2.041e+04
Df Model: 13
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
main_camera_mp -0.2818 0.083 -3.410 0.001 -0.444 -0.120
selfie_camera_mp 0.7872 0.089 8.893 0.000 0.614 0.961
int_memory 0.0849 0.010 8.541 0.000 0.065 0.104
ram 15.9125 0.349 45.652 0.000 15.229 16.596
days_used -0.0850 0.001 -58.218 0.000 -0.088 -0.082
new_price 0.3874 0.003 137.089 0.000 0.382 0.393
brand_name_Apple 7.2813 2.370 3.072 0.002 2.634 11.929
brand_name_Gionee -5.6479 2.311 -2.444 0.015 -10.180 -1.116
brand_name_Google 10.8801 4.112 2.646 0.008 2.817 18.943
brand_name_Infinix -15.1234 5.795 -2.610 0.009 -26.486 -3.761
brand_name_Nokia -7.6950 1.625 -4.735 0.000 -10.882 -4.508
brand_name_OnePlus -16.4609 3.514 -4.685 0.000 -23.351 -9.571
os_Others -3.9859 1.319 -3.023 0.003 -6.571 -1.400
4g_yes -3.0030 0.796 -3.772 0.000 -4.564 -1.442
==============================================================================
Omnibus: 297.372 Durbin-Watson: 1.970
Prob(Omnibus): 0.000 Jarque-Bera (JB): 679.906
Skew: 0.702 Prob(JB): 2.29e-148
Kurtosis: 5.135 Cond. No. 1.54e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.54e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Now no feature has p-value greater than 0.05, we will consider the features in x_train3 as the final ones and olsmod2 as the final model
Observation
Now we'll check the rest of the assumptions on olsmod2.
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
How to check linearity and independence?
# let's create a dataframe with actual, fitted and residual values
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train # actual values
df_pred["Fitted Values"] = olsmod2.fittedvalues # predicted values
df_pred["Residuals"] = olsmod2.resid # residuals
df_pred.head()
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 844 | 100.48 | 99.945686 | 0.534314 |
| 1539 | 111.68 | 117.259896 | -5.579896 |
| 3452 | 113.89 | 112.268446 | 1.621554 |
| 1727 | 64.09 | 70.591034 | -6.501034 |
| 1926 | 67.95 | 68.430949 | -0.480949 |
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
Why the test?
How to check normality?
How to fix if this assumption is not followed?
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of residuals")
plt.show()
import pylab
import scipy.stats as stats
stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab)
plt.show()
stats.shapiro(df_pred["Residuals"])
ShapiroResult(statistic=0.9610329270362854, pvalue=2.134022823018388e-25)
Homoscedascity: If the variance of the residuals is symmetrically distributed across the regression line, then the data is said to be homoscedastic.
Heteroscedascity: If the variance is unequal for the residuals across the regression line, then the data is said to be heteroscedastic.
Why the test?
How to check for homoscedasticity?
How to fix if this assumption is not followed?
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train3)
lzip(name, test)
[('F statistic', 1.0487773629610773), ('p-value', 0.20135126405844322)]
Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this assumption is satisfied.
Now that we have checked all the assumptions of linear regression and they are satisfied, we can move towards the prediction part.
# predictions on the test set
pred = olsmod2.predict(x_test3)
df_pred_test = pd.DataFrame({"Actual": y_test, "Predicted": pred})
df_pred_test.sample(10, random_state=1)
| Actual | Predicted | |
|---|---|---|
| 2098 | 30.5200 | 23.079415 |
| 278 | 195.6700 | 188.682824 |
| 26 | 247.1925 | 224.762738 |
| 2910 | 89.9700 | 91.279577 |
| 2631 | 69.2000 | 63.993617 |
| 1582 | 89.5800 | 108.524466 |
| 2110 | 247.1925 | 263.744572 |
| 3160 | 65.3400 | 65.515000 |
| 2817 | 115.7700 | 106.128341 |
| 549 | 39.2900 | 48.311867 |
We can observe here that our model has returned pretty good prediction results, and the actual and predicted values are comparable.
We can also visualize comparison result as a bar graph.
Note: As the number of records is large, for representation purpose, we are taking a sample of 25 records only.
df1 = df_pred_test.sample(25, random_state=1)
df1.plot(kind="bar", figsize=(15, 7))
plt.show()
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmod2_train_perf = model_performance_regression(olsmod2, x_train3, y_train)
olsmod2_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 14.04901 | 10.278116 | 0.954564 | 0.954308 | 18.665149 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmod2_test_perf = model_performance_regression(olsmod2, x_test3, y_test)
olsmod2_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 13.722107 | 10.109717 | 0.957586 | 0.957025 | 16.300215 |
Let's compare the initial model created with sklearn and the final statsmodels model.
# training performance comparison
models_train_comp_df = pd.concat(
[linearregression_train_perf.T, olsmod2_train_perf.T], axis=1,
)
models_train_comp_df.columns = [
"Linear Regression sklearn",
"Linear Regression statsmodels",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Linear Regression sklearn | Linear Regression statsmodels | |
|---|---|---|
| RMSE | 13.960441 | 14.049010 |
| MAE | 10.222224 | 10.278116 |
| R-squared | 0.955136 | 0.954564 |
| Adj. R-squared | 0.954257 | 0.954308 |
| MAPE | 18.489055 | 18.665149 |
# test performance comparison
models_test_comp_df = pd.concat(
[linearregression_test_perf.T, olsmod2_test_perf.T], axis=1,
)
models_test_comp_df.columns = [
"Linear Regression sklearn",
"Linear Regression statsmodels",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Linear Regression sklearn | Linear Regression statsmodels | |
|---|---|---|
| RMSE | 13.745320 | 13.722107 |
| MAE | 10.171443 | 10.109717 |
| R-squared | 0.957443 | 0.957586 |
| Adj. R-squared | 0.955446 | 0.957025 |
| MAPE | 16.417574 | 16.300215 |
Let's recreate the final statsmodels model and print it's summary to gain insights.
olsmodel_final = sm.OLS(y_train, x_train3).fit()
print(olsmodel_final.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price R-squared: 0.955
Model: OLS Adj. R-squared: 0.954
Method: Least Squares F-statistic: 4016.
Date: Tue, 17 Aug 2021 Prob (F-statistic): 0.00
Time: 13:18:00 Log-Likelihood: -10150.
No. Observations: 2499 AIC: 2.033e+04
Df Residuals: 2485 BIC: 2.041e+04
Df Model: 13
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
main_camera_mp -0.2818 0.083 -3.410 0.001 -0.444 -0.120
selfie_camera_mp 0.7872 0.089 8.893 0.000 0.614 0.961
int_memory 0.0849 0.010 8.541 0.000 0.065 0.104
ram 15.9125 0.349 45.652 0.000 15.229 16.596
days_used -0.0850 0.001 -58.218 0.000 -0.088 -0.082
new_price 0.3874 0.003 137.089 0.000 0.382 0.393
brand_name_Apple 7.2813 2.370 3.072 0.002 2.634 11.929
brand_name_Gionee -5.6479 2.311 -2.444 0.015 -10.180 -1.116
brand_name_Google 10.8801 4.112 2.646 0.008 2.817 18.943
brand_name_Infinix -15.1234 5.795 -2.610 0.009 -26.486 -3.761
brand_name_Nokia -7.6950 1.625 -4.735 0.000 -10.882 -4.508
brand_name_OnePlus -16.4609 3.514 -4.685 0.000 -23.351 -9.571
os_Others -3.9859 1.319 -3.023 0.003 -6.571 -1.400
4g_yes -3.0030 0.796 -3.772 0.000 -4.564 -1.442
==============================================================================
Omnibus: 297.372 Durbin-Watson: 1.970
Prob(Omnibus): 0.000 Jarque-Bera (JB): 679.906
Skew: 0.702 Prob(JB): 2.29e-148
Kurtosis: 5.135 Cond. No. 1.54e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.54e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
All factors that have positive impact on price are: Ram, Google, Apple, selfie camera, new price and internal memory. As these factors increase, the used price increase
Ram and Google and Apple brand name turns out to have a very significant impact on the price of used phones. As these factors increase, the use price increase (as these two have positive relationship with used phone price, and they have positive coefficient sign)
Of all brands, here are the brands that have most positive impact on the used prices: Apple, Google Of all brands, here are the brands that have negative impact on the used prices: Nokia, Gionee, OnePlus. ( Infinix and OnePlus have a strongest negative impact on the used price with coefficient being -15 and -16 . This mean if the phone is under these 2 brands, the price will decrease by 15 or 16 USD.) !
Recommendation